Dr. Katherine Shoemaker
2023-06-17
Mosquitoes are one of the most dangerous insects. They are found almost everywhere around the world. Mosquitoes transmit various diseases to human by biting and drawing blood. The only way to protect yourself is to avoid bites from infected mosquitoes. However, in order to be vigilant, we need to better understand its behavior and reproductive density and development. The dataset “Arthropod Master Report 2015_2020” was collected from traps designed and placed throughout Houston over the period of 2015-2020. We decide to use this dataset to analyze mosquito’s biological behavior, such as the growth, in Houston environment.
There are 173743 observations of 39 variables in this data set. However, result of many traps were sent to many Labs at the same time and repeated in the data set. It caused 5574 duplicated rows occurred if we delete ‘LabNo’ column. After cleaning the data, we have 168169 observations of 38 variables.
## [1] 168169 38
We split data into many subsets to analyze which factor is significant on the amount of mosquitoes collected.
The first subset is about the amount of mosquitoes over time. We select time (DateTime), type of mosquitoes (SpeciesAbbr1), the amount of female mosquitoes (TotFemCount1), the amount of male mosquitoes (TotMaleCount1), and type of trap they desgined (TrapType). We use this code below to clean this subset of data.
data1 <- data %>%
select(DateTime, SpeciesAbbr1,
TotFemCount1, TotMaleCount1, TrapType) %>%
na.omit()
data1 <- data.frame(data1)
attach(data1)## The following objects are masked from data:
##
## DateTime, SpeciesAbbr1, TotFemCount1, TotMaleCount1, TrapType
The amount of female species is 4686179
The amount of male species is: 639737
The total amount of species which were collected is: 5325916
The rate of female species is: 88%
The rate of male species is: 12%
The total amount of types of species which were collected is: 42
data1_1 <- data1 %>% summarise(total_female = sum(TotFemCount1),
total_male = sum(TotMaleCount1))
print.data.frame(data1_1,row.names=F)## total_female total_male
## 4686179 639737
There are 4686179 female mosquitoes and 639737 male mosquitoes collected.
This produces the bar chart with the counts of the type of mosquitoes for each gender.What can be clearly seen in this chart is Cxqf is the highest type of mosquitoes which were collected. And the number of female mosquitoes is much higher than the male in general.
In the summer and until the end of October, the number of mosquitoes rise up dramatically. It can be said that this season of the year in Houston is when mosquitoes are most active.
After grouping data by the Trap Type, we identify that BG, GV and SS are 3 main types of trap they used to collect mosquitoes in Houston.
## # A tibble: 18 × 3
## # Groups: weekday [6]
## weekday TrapType total
## <dbl> <chr> <dbl>
## 1 1 BG 744
## 2 1 GV 1571
## 3 1 SS 607
## 4 3 BG 465911
## 5 3 GV 726150
## 6 3 SS 638553
## 7 4 BG 317779
## 8 4 GV 664427
## 9 4 SS 410007
## 10 5 BG 344455
## 11 5 GV 632789
## 12 5 SS 349837
## 13 6 BG 134997
## 14 6 GV 377307
## 15 6 SS 246593
## 16 7 BG 7241
## 17 7 GV 4434
## 18 7 SS 2514
And this chart below is a comparison between the number of mosquitoes from 3 trap types by day of the week.
However, only 6 points represent for 6 days of the week on each line. The traps were just collected on 6 days and the collector take a break on Tuesday.
Over the five-year period, there is always a significant difference between the number of male and female mosquitoes. Female species account for a much larger number and tend to decrease rapidly from 2017 to 2019.
It can be seen that the fluctuations of the number of female samples and of the total number of collected mosquitoes tend to be almost the same. They both increased during the beginning of the period and decreased substantially since 2017.
After going through the steps of collection and classification, the research Labs will begin to examine the common diseases that the mosquito species can spread. They are marked Positive if they carry the disease. This table below is the statistics of the amount of Positive mosquitoes each year.
## # A tibble: 6 × 2
## Year total_mosquitoes
## <dbl> <dbl>
## 1 2015 40103
## 2 2016 8581
## 3 2017 8890
## 4 2018 26616
## 5 2019 3612
## 6 2020 2033
The second subset is about the amount of mosquitoes over the area of Houston (Zipcode map). However, in the data set, there are 18 observations which show mosquitoes are collected in Zipcode 7737. We will create a graph of Zipcode without that strange zipcode. We use this code below to clean this subset of data.
data2 <- data %>% filter(Zip!=7737) %>% group_by(Zip) %>% na.omit() %>%
mutate(total_mosquitoes = sum(sum(TotFemCount1),sum(TotMaleCount1))) ggplot(heat_map_of_Houston) + geom_sf(aes(fill = total_mosquitoes)) + scale_fill_viridis_c(option = "inferno" , begin = 0.1, label = scales::comma) + stat_sf_coordinates() + theme(legend.position="right",
plot.title = element_text(hjust = 0.5,color = "dark red", size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = "dark red", size = 14, face = "bold"),
plot.caption = element_text(color = "Gray60"))+
guides(fill = guide_legend(title = "Mosquitoes Count", title.position = "top",
title.theme =element_text(size = 10, face = "bold",
colour = "Black",angle = 0))) +
geom_sf_text(aes(label = ZCTA5CE20), colour = "grey", alpha = 0.6) +
labs(title = "Mosquitoes Count by Zipcode in Houston", subtitle = "2015 to 2020")A point worth noting is the further away from the center, the higher the amount of mosquitoes is. We could understand that the habitat in the suburbs of Houston is a suitable place for mosquitoes to live and grow, as well as a place with a high risk of mosquito-borne diseases.
So, why are mosquitoes a serious year-around problem in Houston? Is Houston weather one of the factors involved in this proliferation?
We analyze the association between Houston weather and mosquito populationis by combining this data set and the Weather data set. The Weather Data is taken from The National Oceanic and Atmospheric Administration (NOAA)’s Climate Data Online website and at the location “HOUSTON NATIONAL WEATHER SERVICE OFFICE, TX US” with the time coincident with the time of Mosquitoes Data.
## [1] "STATION" "NAME" "LATITUDE" "LONGITUDE" "ELEVATION" "DATE"
## [7] "AWND" "DAPR" "MDPR" "PGTM" "PRCP" "PSUN"
## [13] "SNOW" "SNWD" "TAVG" "TMAX" "TMIN" "TOBS"
## [19] "WDF2" "WDF5" "WESD" "WESF" "WSF2" "WSF5"
## [25] "WT01" "WT02" "WT03" "WT04" "WT05" "WT06"
## [31] "WT07" "WT08" "WT10"
## [1] "2015-01-01" "2020-12-31"
data4 <-weather %>% group_by(DATE) %>% filter (NAME == 'HOUSTON NATIONAL WEATHER SERVICE OFFICE, TX US') %>% select(DATE, NAME, PRCP, LATITUDE, LONGITUDE) data4$Date <- as.Date(data4$DATE)
# Merge the data sets based on date-time column
merged_data <- data %>% inner_join(data4, by=c('Date'))
merged_data <- merged_data %>% select(SpeciesAbbr1, TotFemCount1, TotMaleCount1, Date, DateTime, PRCP) %>%
group_by(Date, PRCP) %>%
summarise(total = sum(TotFemCount1) + sum(TotMaleCount1)) %>%
na.omit(total)We use simple linear regression to predict the amount of mosquito species on the basis of the precipitation rate of Houston at the specified time.
##
## Call:
## lm(formula = total ~ PRCP, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4662.3 -3241.5 -841.5 2179.5 23968.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4634.86 144.74 32.021 <2e-16 ***
## PRCP 92.07 233.77 0.394 0.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4040 on 850 degrees of freedom
## Multiple R-squared: 0.0001824, Adjusted R-squared: -0.0009938
## F-statistic: 0.1551 on 1 and 850 DF, p-value: 0.6938
However, p-value is approximately 0.69 > 0.05 (alpha), we failed to reject the null hypothesis. The rate of the precipitation in Houston is not significant of predictor of the amount of mosquitoes.
We check the normality of the model to see if it meets the requirement of other test how well the model fits our data.
However, by visualizing, a Q-Q plot shows the points in the plot does not fall along a straight diagonal line. Therefore, this data is not normally distributed. It means that the precipitation rate of Houston does not affect on the amount of mosquitoes on that date.
Lagging-time series is the method we use to get the autocorrection between 2 time series of Weather Data and Mosquitoes Data by delaying 1 time series.
Definition: The coefficient of correlation between two values in a time series is called the autocorrelation function (ACF). This value of k is the time gap being considered and is called the lag. A lag 1 autocorrelation (i.e., k = 1 in the above) is the correlation between values that are one time period apart. More generally, a lag k autocorrelation is the correlation between values that are k time periods apart. - by Iain Pardoe in book “Applied Regression Modeling - 3rd edition”
We use lag() function to shift the time series back 3-7 days to get the coefficient of correlation of the precipitation few days ago with the amount of mosquitoes at time. We predict mosquitoes species based on the the precipitation rate of 3-7 days ago.
merged_data$PRCP_lag3 <- lag(merged_data$PRCP, 3)
merged_data$PRCP_lag4 <- lag(merged_data$PRCP, 4)
merged_data$PRCP_lag5 <- lag(merged_data$PRCP, 5)
merged_data$PRCP_lag6 <- lag(merged_data$PRCP, 6)
merged_data$PRCP_lag7 <- lag(merged_data$PRCP, 7)
model <- lm(total~PRCP_lag3+PRCP_lag4+PRCP_lag5+PRCP_lag6+PRCP_lag7, merged_data)
summary(model)##
## Call:
## lm(formula = total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag5 + PRCP_lag6 +
## PRCP_lag7, data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5700.9 -3208.9 -772.9 2272.5 24295.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4302.3 159.4 26.989 <2e-16 ***
## PRCP_lag3 427.5 236.2 1.810 0.0706 .
## PRCP_lag4 282.9 241.1 1.173 0.2410
## PRCP_lag5 266.1 241.0 1.104 0.2699
## PRCP_lag6 574.8 241.1 2.384 0.0173 *
## PRCP_lag7 558.1 236.2 2.363 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3990 on 839 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.0277, Adjusted R-squared: 0.0219
## F-statistic: 4.78 on 5 and 839 DF, p-value: 0.0002583
p-value of the model 0.0002583 is less than alpha 0.05, we reject the null hypothesis. Therefore, there is statistically significant relationship between the number of mosquitoes and PRCP delayed from 3-7 days. However, p-value of PRCP_lag6 and 7 are less than alpha 0.05. It means that the precipitation rate of 6 or 7 days before will affect on the population of mosquitoes at the time of prediction.
#Check accuracy
# Split the data into training and testing sets
set.seed(123123)
trainIndex <- createDataPartition(merged_data$total, p = .8, list = FALSE)
train <- merged_data[trainIndex, ]
test <- merged_data[-trainIndex, ]
# Fit the multiple linear regression model
new_model <- lm(total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag5 + PRCP_lag6 + PRCP_lag7, data = train)
# Make predictions on the test set
predictions <- predict(new_model, newdata = test)
# Compute the accuracy metrics
MSE <- mean(na.omit((test$total - predictions)^2))
RMSE <- sqrt(MSE)
MAE <- mean(abs(na.omit(test$total - predictions)))
R2 <- cor(test$total, na.omit(predictions))^2
MSE## [1] 17902651
## [1] 4231.152
## [1] 3194.094
## [1] 0.02038507
The R-squared value represents the proportion of variance in the response variable that is explained by the predictor variable. Higher values indicate a better fit of the model. R-squared value of this model is 0.0203850682546335. It is not bad, but not really good number. It needs to be closer to 1 to get better fit for our model.
The MSE, RMSE, and MAE represent the average squared 17902650.6670992, square-rooted 4231.15240414467, and absolute differences 3194.09426819416 between the predicted and actual values, respectively. Lower values indicate better accuracy of the model’s predictions.
The Q-Q plot of this model shows the points in the plot does not fall along a straight diagonal line. Therefore, this data is not normally distributed. We need to improve the model to get the best fit.
## Start: AIC=14018.79
## total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag5 + PRCP_lag6 + PRCP_lag7
##
## Df Sum of Sq RSS AIC
## - PRCP_lag5 1 19404358 1.3378e+10 14018
## - PRCP_lag4 1 21916183 1.3380e+10 14018
## <none> 1.3359e+10 14019
## - PRCP_lag3 1 52164367 1.3411e+10 14020
## - PRCP_lag7 1 88891372 1.3447e+10 14022
## - PRCP_lag6 1 90493351 1.3449e+10 14022
##
## Step: AIC=14018.02
## total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag6 + PRCP_lag7
##
## Df Sum of Sq RSS AIC
## <none> 1.3378e+10 14018
## - PRCP_lag4 1 32221507 1.3410e+10 14018
## + PRCP_lag5 1 19404358 1.3359e+10 14019
## - PRCP_lag3 1 51717614 1.3430e+10 14019
## - PRCP_lag7 1 88306684 1.3466e+10 14022
## - PRCP_lag6 1 112509798 1.3490e+10 14023
##
## Call:
## lm(formula = total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag6 + PRCP_lag7,
## data = merged_data)
##
## Coefficients:
## (Intercept) PRCP_lag3 PRCP_lag4 PRCP_lag6 PRCP_lag7
## 4332.1 425.7 336.1 628.0 556.3
After using step() function to get the better fit, AIC does not change much, but we still test the new model to see if the model is improved or not.
##
## Call:
## lm(formula = total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag6 + PRCP_lag7,
## data = merged_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5726.7 -3235.1 -788.8 2287.0 24265.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4332.1 157.1 27.571 < 2e-16 ***
## PRCP_lag3 425.7 236.2 1.802 0.07190 .
## PRCP_lag4 336.1 236.3 1.422 0.15529
## PRCP_lag6 628.0 236.3 2.658 0.00801 **
## PRCP_lag7 556.3 236.2 2.355 0.01877 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3991 on 840 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.02628, Adjusted R-squared: 0.02165
## F-statistic: 5.669 on 4 and 840 DF, p-value: 0.0001668
The new Adjusted R-squared is 0.02165 is even smaller than the previous model’s R-square (0.0219). We could not use this model as a final model.
Another way is Outlier Detection and Removal by using Cook’s distance.
#calculate Cook's distance
cd <- cooks.distance(model)
#Identify the influential observations using a threshold value of Cook's distance
threshold <- 4/nrow(merged_data)
influential <- which(cd > threshold)
#Remove the influential observations from the data set
merged_data_clean <- merged_data[-influential,]
#Fit the model again
model_clean <- lm(total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag6 + PRCP_lag7, data = merged_data_clean)
summary(model_clean)##
## Call:
## lm(formula = total ~ PRCP_lag3 + PRCP_lag4 + PRCP_lag6 + PRCP_lag7,
## data = merged_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5837.9 -3259.2 -780.7 2317.0 24307.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4289.7 160.7 26.692 <2e-16 ***
## PRCP_lag3 524.3 244.8 2.142 0.0325 *
## PRCP_lag4 370.2 247.9 1.494 0.1357
## PRCP_lag6 562.4 252.6 2.227 0.0262 *
## PRCP_lag7 542.1 240.9 2.251 0.0247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3998 on 800 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.02628, Adjusted R-squared: 0.02142
## F-statistic: 5.399 on 4 and 800 DF, p-value: 0.0002717
Its R-squared still decreases. So, we could give conclude that we should use all the lagged-time series of the precipitation from 3-7 days to predict the total amount of mosquitoes.
Why should we use lagged-time series?
We choose a specified period from November 2020 to the end of that year and make a chart to see how significant the total amount of collected mosquitoes with the original precipitation and the lagged-time series precipitation.
merged_data %>% filter(Date >= as.Date("2020-11-01") &
Date <= as.Date("2020-12-31")) %>%
select(Date, total, PRCP, PRCP_lag6,PRCP_lag7)## # A tibble: 18 × 5
## # Groups: Date [18]
## Date total PRCP PRCP_lag6 PRCP_lag7
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-11-04 7402 0 0.07 0
## 2 2020-11-05 7119 0 0 0.07
## 3 2020-11-10 13842 0 0 0
## 4 2020-11-12 12208 0 0.26 0
## 5 2020-11-13 4678 0 0.09 0.26
## 6 2020-11-17 4705 0 0 0.09
## 7 2020-11-18 6667 0 0 0
## 8 2020-11-19 6797 0 0 0
## 9 2020-11-20 3690 0 0 0
## 10 2020-11-24 5785 0.17 0 0
## 11 2020-11-25 4544 0 0 0
## 12 2020-12-04 192 0 0 0
## 13 2020-12-08 2635 0 0 0
## 14 2020-12-09 4439 0 0 0
## 15 2020-12-10 7147 0 0 0
## 16 2020-12-11 2409 0.28 0.17 0
## 17 2020-12-16 118 0 0 0.17
## 18 2020-12-17 48 0 0 0
As it is observed, the volatility of the original PRCP does not reflect on the mosquito population at all. However, the PRCP_lag6 and lag7 almost overlap the bar columns, and have the same trend of the fluctuation as the total amount of mosquitoes.
To sum up, the number of mosquitoes in Houston is extremely large and needs to be analyzed and understood to enhance the awareness of the dangers that mosquitoes bring. Depending on the environment conditions such as terrain, weather, etc., mosquitoes reproduce and develop in different numbers. In particular, Houston’s weather, which has high humidity, is a favorable condition for mosquitoes to grow. However, the amount of mosquitoes that will be affected and fluctuates is almost based on the precipitation of the previous 6-7 day period. This may be referred to as the precention period or the mosquito prediction period.